10. Dealing with NaN
Dealing with NaN
Pandas 6 V1
As mentioned earlier, before we can begin training our learning algorithms with large datasets, we usually need to clean the data first. This means we need to have a method for detecting and correcting errors in our data. While any given dataset can have many types of bad data, such as outliers or incorrect values, the type of bad data we encounter almost always is missing values. As we saw earlier, Pandas assigns
NaN
values to missing data. In this lesson we will learn how to detect and deal with
NaN
values.
We will begin by creating a DataFrame with some
NaN
values in it.
# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes':8, 'suits':45},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5, 'shirts': 2, 'shoes':5, 'suits':7},
{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes':10}]
# We create a DataFrame and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])
# We display the DataFrame
store_items
** bikes** glasses pants shirts shoes suits watches store 1 20 NaN 30 15.0 8 45.0 35 store 2 15 50.0 5 2.0 5 7.0 10 store 3 20 4.0 30 NaN 10 NaN 35
We can clearly see that the DataFrame we created has 3
NaN
values: one in store 1 and two in store 3. However, in cases where we load very large datasets into a DataFrame, possibly with millions of items, the number of
NaN
values is not easily visualized. For these cases, we can use a combination of methods to count the number of
NaN
values in our data. The following example combines the
.isnull()
and the
sum()
methods to count the number of
NaN
values in our DataFrame
# We count the number of NaN values in store_items
x = store_items.isnull().sum().sum()
# We print x
print('Number of NaN values in our DataFrame:', x)
Number of NaN values in our DataFrame: 3
In the above example, the
.isnull()
method returns a
Boolean
DataFrame of the same size as
store_items
and indicates with
True
the elements that have
NaN
values and with
False
the elements that are not. Let's see an example:
store_items.isnull()
** bikes** glasses pants shirts shoes suits watches store 1 False True False False False False False store 2 False False False False False False False store 3 False False False True False True False
In Pandas, logical
True
values have numerical value 1 and logical
False
values have numerical value 0. Therefore, we can count the number of
NaN
values by counting the number of logical
True
values. In order to count the total number of logical
True
values we use the
.sum()
method twice. We have to use it twice because the first sum returns a Pandas Series with the sums of logical
True
values along columns, as we see below:
store_items.isnull().sum()
bikes 0
glasses 1
pants 0
shirts 1
shoes 0
suits 1
watches 0
dtype: int64
The second sum will then add up the 1s in the above Pandas Series.
Instead of counting the number of
NaN
values we can also do the opposite, we can count the number of
non-NaN
values. We can do this by using the
.count()
method as shown below:
# We print the number of non-NaN values in our DataFrame
print()
print('Number of non-NaN values in the columns of our DataFrame:\n', store_items.count())
Number of non-NaN values in the columns of our DataFrame:
bikes 3
glasses 2
pants 3
shirts 2
shoes 3
suits 2
watches 3
dtype: int64
Now that we learned how to know if our dataset has any
NaN
values in it, the next step is to decide what to do with them. In general we have two options, we can either
delete
or
replace
the
NaN
values. In the following examples we will show you how to do both.
We will start by learning how to eliminate rows or columns from our DataFrame that contain any
NaN
values. The
.dropna(axis)
method eliminates any
rows
with
NaN
values when
axis = 0
is used and will eliminate any
columns
with
NaN
values when
axis = 1
is used. Let's see some examples
# We drop any rows with NaN values
store_items.dropna(axis = 0)
** bikes** glasses pants shirts shoes suits watches store 2 15 50.0 5 2.0 5 7.0 10
# We drop any columns with NaN values
store_items.dropna(axis = 1)
| | ** bikes** | pants | shoes | watches |
|----------|:-------------:|:------:|:------:|:------:|:------:|:------:|:------:|
| store 1 |20 | 30 | 8 | 35 |
| store 2 |15 | 5 | 5 | 10 |
| store 3 |20 | 30 | 10 | 35 |
Notice that the
.dropna()
method eliminates (drops) the rows or columns with
NaN
values out of place. This means that the original DataFrame is not modified. You can always remove the desired rows or columns in place by setting the keyword
inplace = True
inside the
dropna()
function.
Now, instead of eliminating
NaN
values, we can replace them with suitable values. We could choose for example to replace all
NaN
values with the value 0. We can do this by using the
.fillna()
method as shown below.
# We replace all NaN values with 0
store_items.fillna(0)
** bikes** glasses pants shirts shoes suits watches store 1 20 0.0 30 15.0 8 45.0 35 store 2 15 50.0 5 2.0 5 7.0 10 store 3 20 4.0 30 0.0 10 0.0 35
We can also use the
.fillna()
method to replace
NaN
values with previous values in the DataFrame, this is known as
forward filling
. When replacing
NaN
values with forward filling, we can use previous values taken from columns or rows. The
.fillna(method = 'ffill', axis)
will use the forward filling (
ffill
) method to replace
NaN
values using the previous known value along the given
axis
. Let's see some examples:
# We replace NaN values with the previous value in the column
store_items.fillna(method = 'ffill', axis = 0)
** bikes** glasses pants shirts shoes suits watches store 1 20 NaN 30 15.0 8 45.0 35 store 2 15 50.0 5 2.0 5 7.0 10 store 3 20 4.0 30 2.0 10 7.0 35
Notice that the two
NaN
values in
store 3
have been replaced with previous values in their columns. However, notice that the
NaN
value in
store 1
didn't get replaced. That's because there are no previous values in this column, since the
NaN
value is the first value in that column. However, if we do forward fill using the previous row values, this won't happen. Let's take a look:
# We replace NaN values with the previous value in the row
store_items.fillna(method = 'ffill', axis = 1)
** bikes** glasses pants shirts shoes suits watches store 1 20.0 20.0 30.0 15.0 8.0 45.0 35.0 store 2 15.0 50.0 5.0 2.0 5.0 7.0 10.0 store 3 20.0 4.0 30.0 30.0 10.0 10.0 35.0
We see that in this case all the
NaN
values have been replaced with the previous row values.
Similarly, you can choose to replace the
NaN
values with the values that go after them in the DataFrame, this is known as
backward filling
. The
.fillna(method = 'backfill', axis)
will use the backward filling (
backfill
) method to replace
NaN
values using the next known value along the given
axis
. Just like with forward filling we can choose to use row or column values. Let's see some examples:
# We replace NaN values with the next value in the column
store_items.fillna(method = 'backfill', axis = 0)
** bikes** glasses pants shirts shoes suits watches store 1 20 50.0 30 15.0 8 45.0 35 store 2 15 50.0 5 2.0 5 7.0 10 store 3 20 4.0 30 NaN 10 NaN 35
Notice that the
NaN
value in
store 1
has been replaced with the next value in its column. However, notice that the two
NaN
values in
store 3
didn't get replaced. That's because there are no next values in these columns, since these
NaN
values are the last values in those columns. However, if we do backward fill using the next row values, this won't happen. Let's take a look:
# We replace NaN values with the next value in the row
store_items.fillna(method = 'backfill', axis = 1)
** bikes** glasses pants shirts shoes suits watches store 1 20.0 30.0 30.0 15.0 8.0 45.0 35.0 store 2 15.0 50.0 5.0 2.0 5.0 7.0 10.0 store 3 20.0 4.0 30.0 10.0 10.0 35.0 35.0
Notice that the
.fillna()
method replaces (fills) the
NaN
values out of place. This means that the original DataFrame is not modified. You can always replace the
NaN
values in place by setting the keyword
inplace = True
inside the
fillna()
function.
We can also choose to replace
NaN
values by using different interpolation methods. For example, the
.interpolate(method = 'linear', axis)
method will use
linear
interpolation to replace
NaN
values using the values along the given
axis
. Let's see some examples:
# We replace NaN values by using linear interpolation using column values
store_items.interpolate(method = 'linear', axis = 0)
** bikes** glasses pants shirts shoes suits watches store 1 20 NaN 30 15.0 8 45.0 35 store 2 15 50.0 5 2.0 5 7.0 10 store 3 20 4.0 30 2.0 10 7.0 35
Notice that the two
NaN
values in
store 3
have been replaced with linear interpolated values. However, notice that the
NaN
value in
store 1
didn't get replaced. That's because the
NaN
value is the first value in that column, and since there is no data before it, the interpolation function can't calculate a value. Now, let's interpolate using row values instead:
# We replace NaN values by using linear interpolation using row values
store_items.interpolate(method = 'linear', axis = 1)
** bikes** glasses pants shirts shoes suits watches store 1 20.0 25.0 30.0 15.0 8.0 45.0 35.0 store 2 15.0 50.0 5.0 2.0 5.0 7.0 10.0 store 3 20.0 4.0 30.0 20.0 10.0 22.5 35.0
Just as with the other methods we saw, the
.interpolate()
method replaces
NaN
values out of place.